Computation and Language 95
☆ ALTA: Compiler-Based Analysis of Transformers
We propose a new programming language called ALTA and a compiler that can map
ALTA programs to Transformer weights. ALTA is inspired by RASP, a language
proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler
from RASP programs to Transformer weights. ALTA complements and extends this
prior work, offering the ability to express loops and to compile programs to
Universal Transformers, among other advantages. ALTA allows us to
constructively show how Transformers can represent length-invariant algorithms
for computing parity and addition, as well as a solution to the SCAN benchmark
of compositional generalization tasks, without requiring intermediate
scratchpad decoding steps. We also propose tools to analyze cases where the
expressibility of an algorithm is established, but end-to-end training on a
given training set fails to induce behavior consistent with the desired
algorithm. To this end, we explore training from ALTA execution traces as a
more fine-grained supervision signal. This enables additional experiments and
theoretical analyses relating the learnability of various algorithms to data
availability and modeling decisions, such as positional encodings. We make the
ALTA framework -- language specification, symbolic interpreter, and weight
compiler -- available to the community to enable further applications and
insights.
☆ TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Recently, multimodal large language models (MLLMs) have received much
attention for their impressive capabilities. The evaluation of MLLMs is
becoming critical to analyzing attributes of MLLMs and providing valuable
insights. However, current benchmarks overlook the problem of prompt
sensitivity - minor prompt variations may lead to significant performance
fluctuations. Thus, inappropriate prompts may obscure the models' capabilities,
underestimating the models' performance. Moreover, different models have
different preferences for different prompts, and thus, using the same prompt
for all models will cause evaluation bias. This paper analyzes this deficiency
in existing benchmarks and further introduces a new evaluation framework named
TP-Eval, which introduces a prompt customization method to reduce evaluation
biases and tap models' potential. TP-Eval will rewrite the original prompts to
different customized prompts for different models. In particular, we propose
some well-designed modules for prompt customization tailored to the scenario of
MLLM evaluation. Extensive experiments demonstrate the effectiveness of our
approach to uncovering models' capabilities, and TP-Eval should benefit the
community in developing more comprehensive and convincing MLLM evaluation
benchmarks.
☆ CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, Boris Mikheev, Denis Bobkov, Aibek Alanov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
Machine Unlearning (MU) is critical for enhancing privacy and security in
deep learning models, particularly in large multimodal language models (MLLMs),
by removing specific private or hazardous information. While MU has made
significant progress in textual and visual modalities, multimodal unlearning
(MMU) remains significantly underexplored, partially due to the absence of a
suitable open-source benchmark. To address this, we introduce CLEAR, a new
benchmark designed to evaluate MMU methods. CLEAR contains 200 fictitious
individuals and 3,700 images linked with corresponding question-answer pairs,
enabling a thorough evaluation across modalities. We assess 10 MU methods,
adapting them for MMU, and highlight new challenges specific to multimodal
forgetting. We also demonstrate that simple $\ell_1$ regularization on LoRA
weights significantly mitigates catastrophic forgetting, preserving model
performance on retained data. The dataset is available at
https://huggingface.co/datasets/therem/CLEAR
☆ LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering EMNLP 2024
Long-Context Question Answering (LCQA), a challenging task, aims to reason
over long-context documents to yield accurate answers to questions. Existing
long-context Large Language Models (LLMs) for LCQA often struggle with the
"lost in the middle" issue. Retrieval-Augmented Generation (RAG) mitigates this
issue by providing external factual evidence. However, its chunking strategy
disrupts the global long-context information, and its low-quality retrieval in
long contexts hinders LLMs from identifying effective factual details due to
substantial noise. To this end, we propose LongRAG, a general,
dual-perspective, and robust LLM-based RAG system paradigm for LCQA to enhance
RAG's understanding of complex long-context knowledge (i.e., global information
and factual details). We design LongRAG as a plug-and-play paradigm,
facilitating adaptation to various domains and LLMs. Extensive experiments on
three multi-hop datasets demonstrate that LongRAG significantly outperforms
long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG
(up by 17.25%). Furthermore, we conduct quantitative ablation studies and
multi-dimensional analyses, highlighting the effectiveness of the system's
components and fine-tuning strategies. Data and code are available at
https://github.com/QingFei1/LongRAG.
comment: EMNLP 2024 Main
☆ Key Algorithms for Keyphrase Generation: Instruction-Based LLMs for Russian Scientific Keyphrases
Keyphrase selection is a challenging task in natural language processing that
has a wide range of applications. Adapting existing supervised and unsupervised
solutions for the Russian language faces several limitations due to the rich
morphology of Russian and the limited number of training datasets available.
Recent studies conducted on English texts show that large language models
(LLMs) successfully address the task of generating keyphrases. LLMs allow
achieving impressive results without task-specific fine-tuning, using text
prompts instead. In this work, we access the performance of prompt-based
methods for generating keyphrases for Russian scientific abstracts. First, we
compare the performance of zero-shot and few-shot prompt-based methods,
fine-tuned models, and unsupervised methods. Then we assess strategies for
selecting keyphrase examples in a few-shot setting. We present the outcomes of
human evaluation of the generated keyphrases and analyze the strengths and
weaknesses of the models through expert assessment. Our results suggest that
prompt-based methods can outperform common baselines even using simple text
prompts.
comment: The 12th International Conference on Analysis of Images, Social
Networks and Texts (AIST'2024)
☆ MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning EMNLP 2024
Low-rank adaptation (LoRA) and its mixture-of-experts (MOE) variants are
highly effective parameter-efficient fine-tuning (PEFT) methods. However, they
introduce significant latency in multi-tenant settings due to the LoRA modules
and MOE routers added to multiple linear modules in the Transformer layer. To
address this issue, we propose Mixture of Low-Rank Adaptation (MiLoRA), a novel
and efficient LoRA variant. MiLoRA differs from previous MOE-style LoRA methods
by considering each LoRA module as an expert and employing a prompt-aware
routing mechanism. This mechanism calculates expert routing results once before
generating the first new token and reuses these results for subsequent tokens,
reducing latency. Extensive experiments and analysis on commonsense reasoning
tasks, math reasoning tasks, and widely used LLM evaluation benchmarks
demonstrate that MiLoRA consistently outperforms strong PEFT baselines with
comparable tunable parameter budgets. Additionally, MiLoRA significantly
reduces latency in multi-tenant settings compared to previous LoRA-based
methods.
comment: Accepted by EMNLP 2024 Findings. arXiv admin note: substantial text
overlap with arXiv:2405.18203
☆ GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration
Xin Li, Qizhi Chu, Yubin Chen, Yang Liu, Yaoqi Liu, Zekai Yu, Weize Chen, Chen Qian, Chuan Shi, Cheng Yang
Graphs are widely used for modeling relational data in real-world scenarios,
such as social networks and urban computing. Existing LLM-based graph analysis
approaches either integrate graph neural networks (GNNs) for specific machine
learning tasks, limiting their transferability, or rely solely on LLMs'
internal reasoning ability, resulting in suboptimal performance. To address
these limitations, we take advantage of recent advances in LLM-based agents,
which have shown capabilities of utilizing external knowledge or tools for
problem solving. By simulating human problem-solving strategies such as analogy
and collaboration, we propose a multi-agent system based on LLMs named
GraphTeam, for graph analysis. GraphTeam consists of five LLM-based agents from
three modules, and the agents with different specialities can collaborate with
each other to address complex problems. Specifically, (1) input-output
normalization module: the question agent extracts and refines four key
arguments from the original question, facilitating the problem understanding,
and the answer agent organizes the results to meet the output requirement; (2)
external knowledge retrieval module: we first build a knowledge base consisting
of relevant documentation and experience information, and then the search agent
retrieves the most relevant entries for each question. (3) problem-solving
module: given the retrieved information from search agent, the coding agent
uses established algorithms via programming to generate solutions, and in case
the coding agent does not work, the reasoning agent will directly compute the
results without programming. Extensive experiments on six graph analysis
benchmarks demonstrate that GraphTeam achieves state-of-the-art performance
with an average 25.85% improvement over the best baseline in terms of accuracy.
The code and data are available at https://github.com/BUPT-GAMMA/GraphTeam.
☆ Cross-lingual Transfer of Reward Models in Multilingual Alignment
Reinforcement learning with human feedback (RLHF) is shown to largely benefit
from precise reward models (RMs). However, recent studies in reward modeling
schemes are skewed towards English, limiting the applicability of RLHF in
multilingual alignments. In this work, we investigate the cross-lingual
transfer of RMs trained in diverse languages, primarily from English. Our
experimental results demonstrate the strong cross-lingual transfer of English
RMs, exceeding target language RMs by 3~4% average increase in Multilingual
RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through
the representation shifts. Finally, we perform multilingual alignment to
exemplify how cross-lingual transfer in RM propagates to enhanced multilingual
instruction-following capability, along with extensive analyses on
off-the-shelf RMs. We release the code, model, and data.
☆ Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages EMNLP 2024
This exploratory study investigates the potential of multilingual Automatic
Post-Editing (APE) systems to enhance the quality of machine translations for
low-resource Indo-Aryan languages. Focusing on two closely related language
pairs, English-Marathi and English-Hindi, we exploit the linguistic
similarities to develop a robust multilingual APE model. To facilitate
cross-linguistic transfer, we generate synthetic Hindi-Marathi and
Marathi-Hindi APE triplets. Additionally, we incorporate a Quality Estimation
(QE)-APE multi-task learning framework. While the experimental results
underline the complementary nature of APE and QE, we also observe that QE-APE
multitask learning facilitates effective domain adaptation. Our experiments
demonstrate that the multilingual APE models outperform their corresponding
English-Hindi and English-Marathi single-pair models by $2.5$ and $2.39$ TER
points, respectively, with further notable improvements over the multilingual
APE model observed through multi-task learning ($+1.29$ and $+1.44$ TER
points), data augmentation ($+0.53$ and $+0.45$ TER points) and domain
adaptation ($+0.35$ and $+0.45$ TER points). We release the synthetic data,
code, and models accrued during this study publicly at
https://github.com/cfiltnlp/Multilingual-APE.
comment: Accepted at Findings of EMNLP 2024
☆ Dependency Graph Parsing as Sequence Labeling EMNLP-2024
Various linearizations have been proposed to cast syntactic dependency
parsing as sequence labeling. However, these approaches do not support more
complex graph-based representations, such as semantic dependencies or enhanced
universal dependencies, as they cannot handle reentrancy or cycles. By
extending them, we define a range of unbounded and bounded linearizations that
can be used to cast graph parsing as a tagging task, enlarging the toolbox of
problems that can be solved under this paradigm. Experimental results on
semantic dependency and enhanced UD parsing show that with a good choice of
encoding, sequence-labeling dependency graph parsers combine high efficiency
with accuracies close to the state of the art, in spite of their simplicity.
comment: Accepted at EMNLP-2024
☆ A Time-Aware Approach to Early Detection of Anorexia: UNSL at eRisk 2024
The eRisk laboratory aims to address issues related to early risk detection
on the Web. In this year's edition, three tasks were proposed, where Task 2 was
about early detection of signs of anorexia. Early risk detection is a problem
where precision and speed are two crucial objectives. Our research group solved
Task 2 by defining a CPI+DMC approach, addressing both objectives
independently, and a time-aware approach, where precision and speed are
considered a combined single-objective. We implemented the last approach by
explicitly integrating time during the learning process, considering the
ERDE{\theta} metric as the training objective. It also allowed us to
incorporate temporal metrics to validate and select the optimal models. We
achieved outstanding results for the ERDE50 metric and ranking-based metrics,
demonstrating consistency in solving ERD problems.
comment: In Conference and Labs of the Evaluation Forum (CLEF 2024), Grenoble,
France
☆ Zeitenwenden: Detecting changes in the German political discourse
From a monarchy to a democracy, to a dictatorship and back to a democracy --
the German political landscape has been constantly changing ever since the
first German national state was formed in 1871. After World War II, the Federal
Republic of Germany was formed in 1949. Since then every plenary session of the
German Bundestag was logged and even has been digitized over the course of the
last few years. We analyze these texts using a time series variant of the topic
model LDA to investigate which events had a lasting effect on the political
discourse and how the political topics changed over time. This allows us to
detect changes in word frequency (and thus key discussion points) in political
discourse.
comment: 7 pages, 6 figures
☆ ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon
Sparse Mixture of Experts (MoE) models, while outperforming dense Large
Language Models (LLMs) in terms of performance, face significant deployment
challenges during inference due to their high memory demands. Existing
offloading techniques, which involve swapping activated and idle experts
between the GPU and CPU, often suffer from rigid expert caching mechanisms.
These mechanisms fail to adapt to dynamic routing, leading to inefficient cache
utilization, or incur prohibitive costs for prediction training. To tackle
these inference-specific challenges, we introduce ExpertFlow, a comprehensive
system specifically designed to enhance inference efficiency by accommodating
flexible routing and enabling efficient expert scheduling between CPU and GPU.
This reduces overhead and boosts system performance. Central to our approach is
a predictive routing path-based offloading mechanism that utilizes a
lightweight predictor to accurately forecast routing paths before computation
begins. This proactive strategy allows for real-time error correction in expert
caching, significantly increasing cache hit ratios and reducing the frequency
of expert transfers, thereby minimizing I/O overhead. Additionally, we
implement a dynamic token scheduling strategy that optimizes MoE inference by
rearranging input tokens across different batches. This method not only reduces
the number of activated experts per batch but also improves computational
efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up
to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times
compared to baseline methods, highlighting its effectiveness and utility as a
robust solution for resource-constrained inference scenarios.
comment: Mixture-of-Experts, Inference, Offloading
☆ SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He
Retrieval-augmented generation (RAG) enhances the question-answering (QA)
abilities of large language models (LLMs) by integrating external knowledge.
However, adapting general-purpose RAG systems to specialized fields such as
science and medicine poses unique challenges due to distribution shifts and
limited access to domain-specific data. To tackle this, we propose SimRAG, a
self-training approach that equips the LLM with joint capabilities of question
answering and question generation for domain adaptation. Our method first
fine-tunes the LLM on instruction-following, question-answering, and
search-related data. Then, it prompts the same LLM to generate diverse
domain-relevant questions from unlabeled corpora, with an additional filtering
strategy to retain high-quality synthetic examples. By leveraging these
synthetic examples, the LLM can improve their performance on domain-specific
RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three
domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.
comment: Work in Progress
☆ ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams
Recent advancements in Text-to-Speech (TTS) technology have led to
natural-sounding speech for English, primarily due to the availability of
large-scale, high-quality web data. However, many other languages lack access
to such resources, relying instead on limited studio-quality data. This
scarcity results in synthesized speech that often suffers from intelligibility
issues, particularly with low-frequency character bigrams. In this paper, we
propose three solutions to address this challenge. First, we leverage
high-quality data from linguistically or geographically related languages to
improve TTS for the target language. Second, we utilize low-quality Automatic
Speech Recognition (ASR) data recorded in non-studio environments, which is
refined using denoising and speech enhancement models. Third, we apply
knowledge distillation from large-scale models using synthetic data to generate
more robust outputs. Our experiments with Hindi demonstrate significant
reductions in intelligibility issues, as validated by human evaluators. We
propose this methodology as a viable alternative for languages with limited
access to high-quality data, enabling them to collectively benefit from shared
resources.
comment: 11 pages, 1 figure, 3 tables
☆ Value Residual Learning For Alleviating Attention Concentration In Transformers
Transformers can capture long-range dependencies using self-attention,
allowing tokens to attend to all others directly. However, stacking multiple
attention layers leads to attention concentration. One natural way to address
this issue is to use cross-layer attention, allowing information from earlier
layers to be directly accessible to later layers. However, this approach is
computationally expensive. To address this problem, we propose Transformer with
residual value (ResFormer) which approximates cross-layer attention through
adding a residual connection from the values of the the first layer to all
subsequent layers. Based on this method, one variant is the Transformer with
single layer value (SVFormer), where all layers share the same value embedding
from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical
evidence demonstrates that ResFormer mitigates attention concentration problem
in deeper layers and enhances representation across most layers, outperforming
the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as
downstream tasks. SVFormer trains significantly faster than the vanilla
Transformer and performs better than other methods like GQA and CLA, with
performance influenced by sequence length and cumulative learning rate.
☆ Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, Lingpeng Kong
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for
text generative modeling, potentially addressing limitations of autoregressive
(AR) models. However, current DLMs have been studied at a smaller scale
compared to their AR counterparts and lack fair comparison on language modeling
benchmarks. Additionally, training diffusion models from scratch at scale
remains challenging. Given the prevalence of open-source AR language models, we
propose adapting these models to build text diffusion models. We demonstrate
connections between AR and diffusion modeling objectives and introduce a simple
continual pre-training approach for training diffusion models. Through
systematic evaluation on language modeling, reasoning, and commonsense
benchmarks, we show that we can convert AR models ranging from 127M to 7B
parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA,
using less than 200B tokens for training. Our experimental results reveal that
these models outperform earlier DLMs and are competitive with their AR
counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters)
capable of generating fluent text, performing in-context learning, filling in
the middle without prompt re-ordering, and following instructions
\url{https://github.com/HKUNLP/DiffuLLaMA}.
comment: 25 pages. Code: https://github.com/HKUNLP/DiffuLLaMA
☆ SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments
The application of natural language processing on political texts as well as
speeches has become increasingly relevant in political sciences due to the
ability to analyze large text corpora which cannot be read by a single person.
But such text corpora often lack critical meta information, detailing for
instance the party, age or constituency of the speaker, that can be used to
provide an analysis tailored to more fine-grained research questions. To enable
researchers to answer such questions with quantitative approaches such as
natural language processing, we provide the SpeakGer data set, consisting of
German parliament debates from all 16 federal states of Germany as well as the
German Bundestag from 1947-2023, split into a total of 10,806,105 speeches.
This data set includes rich meta data in form of information on both reactions
from the audience towards the speech as well as information about the speaker's
party, their age, their constituency and their party's political alignment,
which enables a deeper analysis. We further provide three exploratory analyses,
detailing topic shares of different parties throughout time, a descriptive
analysis of the development of the age of an average speaker as well as a
sentiment analysis of speeches of different parties with regards to the
COVID-19 pandemic.
comment: 10 pages, 3 figures
☆ Understanding Layer Significance in LLM Alignment
Aligning large language models (LLMs) through fine-tuning is essential for
tailoring them to specific applications. Therefore, understanding what LLMs
learn during the alignment process is crucial. Recent studies suggest that
alignment primarily adjusts a model's presentation style rather than its
foundational knowledge, indicating that only certain components of the model
are significantly impacted. To delve deeper into LLM alignment, we propose to
identify which layers within LLMs are most critical to the alignment process,
thereby uncovering how alignment influences model behavior at a granular level.
We propose a novel approach to identify the important layers for LLM alignment
(ILA). It involves learning a binary mask for each incremental weight matrix in
the LoRA algorithm, indicating the significance of each layer. ILA consistently
identifies important layers across various alignment datasets, with nearly 90%
overlap even with substantial dataset differences, highlighting fundamental
patterns in LLM alignment. Experimental results indicate that freezing
non-essential layers improves overall model performance, while selectively
tuning the most critical layers significantly enhances fine-tuning efficiency
with minimal performance loss.
☆ Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models
(LLMs) that employs a generator to suggest reasoning steps and a discriminator
to decide which steps to implement. ToT demonstrates strong performance on
reasoning tasks, often surpassing simple methods such as Input-Output (IO)
prompting and Chain-of-Thought (CoT) reasoning. However, ToT does not
consistently outperform such simpler methods across all models, leaving large
knowledge gaps on the conditions under which ToT is most beneficial. In this
paper, we analyze the roles of the generator and discriminator separately to
better understand the conditions when ToT is beneficial. We find that the
generator plays a more critical role than the discriminator in driving the
success of ToT. While using even a smaller model as the discriminator, scaling
the generator leads to notable improvements in ToT performance, whereas scaling
the discriminator with a fixed generator yields only marginal gains. Our
results show that models across different scales exhibit comparable
discrimination capabilities, yet differ significantly in their generative
performance for ToT.
comment: Code: github.com/mainlp/tot-eval
☆ OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan
Full-duplex spoken dialogue systems significantly advance over traditional
turn-based dialogue systems, as they allow simultaneous bidirectional
communication, closely mirroring human-human interactions. However, achieving
low latency and natural interactions in full-duplex dialogue systems remains a
significant challenge, especially considering human conversation dynamics such
as interruptions, backchannels, and overlapping speech. In this paper, we
introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex
conversation, capable of effectively modeling the complex behaviors inherent to
natural conversations with low latency. To achieve full-duplex communication
capabilities, we propose a multi-stage post-training scheme that progressively
adapts a text-based large language model (LLM) backbone into a speech-text
dialogue LLM, capable of generating text and speech in real time, without
modifying the architecture of the backbone LLM. The training process comprises
three stages: modality alignment, half-duplex dialogue learning, and
full-duplex dialogue learning. Throughout all training stages, we standardize
the data using a flattening operation, which allows us to unify the training
methods and the model architecture across different modalities and tasks. Our
approach offers a straightforward modeling technique and a promising research
direction for developing efficient and natural end-to-end full-duplex spoken
dialogue systems. Audio samples of dialogues generated by OmniFlatten can be
found at this web site (https://omniflatten.github.io/).
comment: Work in progress
☆ Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination
While ongoing advancements in Large Language Models have demonstrated
remarkable success across various NLP tasks, Retrieval Augmented Generation
Model stands out to be highly effective on downstream applications like
Question Answering. Recently, RAG-end2end model further optimized the
architecture and achieved notable performance improvements on domain
adaptation. However, the effectiveness of these RAG-based architectures remains
relatively unexplored when fine-tuned on specialized domains such as customer
service for building a reliable conversational AI system. Furthermore, a
critical challenge persists in reducing the occurrence of hallucinations while
maintaining high domain-specific accuracy. In this paper, we investigated the
performance of diverse RAG and RAG-like architectures through domain adaptation
and evaluated their ability to generate accurate and relevant response grounded
in the contextual knowledge base. To facilitate the evaluation of the models,
we constructed a novel dataset HotelConvQA, sourced from wide range of
hotel-related conversations and fine-tuned all the models on our domain
specific dataset. We also addressed a critical research gap on determining the
impact of domain adaptation on reducing hallucinations across different RAG
architectures, an aspect that was not properly measured in prior work. Our
evaluation shows positive results in all metrics by employing domain
adaptation, demonstrating strong performance on QA tasks and providing insights
into their efficacy in reducing hallucinations. Our findings clearly indicate
that domain adaptation not only enhances the models' performance on QA tasks
but also significantly reduces hallucination across all evaluated RAG
architectures.
comment: Initial Version fine-tuned on HotelConvQA
☆ Latent Structures of Intertextuality in French Fiction
Intertextuality is a key concept in literary theory that challenges
traditional notions of text, signification or authorship. It views texts as
part of a vast intertextual network that is constantly evolving and being
reconfigured. This paper argues that the field of computational literary
studies is the ideal place to conduct a study of intertextuality since we have
now the ability to systematically compare texts with each others. Specifically,
we present a work on a corpus of more than 12.000 French fictions from the
18th, 19th and early 20th century. We focus on evaluating the underlying roles
of two literary notions, sub-genres and the literary canon in the framing of
textuality. The article attempts to operationalize intertextuality using
state-of-the-art contextual language models to encode novels and capture
features that go beyond simple lexical or thematic approaches. Previous
research (Hughes, 2012) supports the existence of a literary "style of a time",
and our findings further reinforce this concept. Our findings also suggest that
both subgenres and canonicity play a significant role in shaping textual
similarities within French fiction. These discoveries point to the importance
of considering genre and canon as dynamic forces that influence the evolution
and intertextual connections of literary works within specific historical
contexts.
comment: 13 pages, 6 figures. Computational Humanities Research Conference
2024
☆ Local Contrastive Editing of Gender Stereotypes EMNLP 2024
Stereotypical bias encoded in language models (LMs) poses a threat to safe
language technology, yet our understanding of how bias manifests in the
parameters of LMs remains incomplete. We introduce local contrastive editing
that enables the localization and editing of a subset of weights in a target
model in relation to a reference model. We deploy this approach to identify and
modify subsets of weights that are associated with gender stereotypes in LMs.
Through a series of experiments, we demonstrate that local contrastive editing
can precisely localize and control a small subset (< 0.5%) of weights that
encode gender bias. Our work (i) advances our understanding of how
stereotypical biases can manifest in the parameter space of LMs and (ii) opens
up new avenues for developing parameter-efficient strategies for controlling
model properties in a contrastive manner.
comment: Accepted at EMNLP 2024
☆ MojoBench: Language Modeling and Benchmarks for Mojo
The recently introduced Mojo programming language (PL) by Modular, has
received significant attention in the scientific community due to its claimed
significant speed boost over Python. Despite advancements in code Large
Language Models (LLMs) across various PLs, Mojo remains unexplored in this
context. To address this gap, we introduce MojoBench, the first framework for
Mojo code generation. MojoBench includes HumanEval-Mojo, a benchmark dataset
designed for evaluating code LLMs on Mojo, and Mojo-Coder, the first LLM
pretrained and finetuned for Mojo code generation, which supports instructions
in 5 natural languages (NLs). Our results show that Mojo-Coder achieves a
30-35% performance improvement over leading models like GPT-4o and
Claude-3.5-Sonnet. Furthermore, we provide insights into LLM behavior with
underrepresented and unseen PLs, offering potential strategies for enhancing
model adaptability. MojoBench contributes to our understanding of LLM
capabilities and limitations in emerging programming paradigms fostering more
robust code generation systems.
☆ Dialectal and Low Resource Machine Translation for Aromanian COLING 2025
We present a neural machine translation system that can translate between
Romanian, English, and Aromanian (an endangered Eastern Romance language); the
first of its kind. BLEU scores range from 17 to 32 depending on the direction
and genre of the text. Alongside, we release the biggest known
Aromanian-Romanian bilingual corpus, consisting of 79k cleaned sentence pairs.
Additional tools such as an agnostic sentence embedder (used for both text
mining and automatic evaluation) and a diacritics converter are also presented.
We publicly release our findings and models. Finally, we describe the
deployment of our quantized model at https://arotranslate.com.
comment: 16 pages, 3 figures, 6 tables, submitted to COLING 2025
☆ CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models
Despite their impressive capabilities, large language models (LLMs) often
lack interpretability and can generate toxic content. While using LLMs as
foundation models and applying semantic steering methods are widely practiced,
we believe that efficient methods should be based on a thorough understanding
of LLM behavior. To this end, we propose using eye movement measures to
interpret LLM behavior across layers. We find that LLMs exhibit patterns
similar to human gaze across layers and different layers function differently.
Inspired by these findings, we introduce a heuristic steering layer selection
and apply it to layer intervention methods via fine-tuning and inference. Using
language toxification and detoxification as test beds, we demonstrate that our
proposed CogSteer methods achieve better results in terms of toxicity scores
while efficiently saving 97% of the computational resources and 60% of the
training time. Our model-agnostic approach can be adopted into various LLMs,
contributing to their interpretability and promoting trustworthiness for safe
deployment.
☆ Beware of Calibration Data for Pruning Large Language Models
As large language models (LLMs) are widely applied across various fields,
model compression has become increasingly crucial for reducing costs and
improving inference efficiency. Post-training pruning is a promising method
that does not require resource-intensive iterative training and only needs a
small amount of calibration data to assess the importance of parameters.
Previous research has primarily focused on designing advanced pruning methods,
while different calibration data's impact on pruning performance still lacks
systematical exploration. We fill this blank and surprisingly observe that the
effects of calibration data even value more than designing advanced pruning
strategies, especially for high sparsity. Our preliminary exploration also
discloses that using calibration data similar to the training data can yield
better performance. As pre-training data is usually inaccessible for advanced
LLMs, we further provide a self-generating calibration data synthesis strategy
to construct feasible calibration data. We conduct experiments on the recent
strong open-source LLMs (e.g., DCLM, and LLaMA-3), and the results show that
the proposed method outperforms commonly used calibration data and can
effectively enhance strong pruning methods (e.g., Wanda, OWL).
comment: under review
☆ An Adaptive Framework for Generating Systematic Explanatory Answer in Online Q&A Platforms
Question Answering (QA) systems face challenges in handling complex questions
that require multi-domain knowledge synthesis. The naive RAG models, although
effective in information retrieval, struggle with complex questions that
require comprehensive and in-depth answers. The pioneering task is defined as
explanatory answer generation, which entails handling identified challenges
such as the requirement for comprehensive information and logical coherence
within the generated context. To address these issues, we refer to systematic
thinking theory and propose SynthRAG, an innovative framework designed to
enhance QA performance. SynthRAG improves on conventional models by employing
adaptive outlines for dynamic content structuring, generating systematic
information to ensure detailed coverage, and producing customized answers
tailored to specific user inquiries. This structured approach guarantees
logical coherence and thorough integration of information, yielding responses
that are both insightful and methodically organized. Empirical evaluations
underscore SynthRAG's effectiveness, demonstrating its superiority in handling
complex questions, overcoming the limitations of naive RAG models, and
significantly improving answer quality and depth. Furthermore, an online
deployment on the Zhihu platform revealed that SynthRAG's answers achieved
notable user engagement, with each response averaging 5.73 upvotes and
surpassing the performance of 79.8% of human contributors, highlighting the
practical relevance and impact of the proposed framework. Our code is available
at https://github.com/czy1999/SynthRAG .
comment: 10 pages, 6 figures
☆ Towards a Similarity-adjusted Surprisal Theory EMNLP 2024
Surprisal theory posits that the cognitive effort required to comprehend a
word is determined by its contextual predictability, quantified as surprisal.
Traditionally, surprisal theory treats words as distinct entities, overlooking
any potential similarity between them. Giulianelli et al. (2023) address this
limitation by introducing information value, a measure of predictability
designed to account for similarities between communicative units. Our work
leverages Ricotta and Szeidl's (2006) diversity index to extend surprisal into
a metric that we term similarity-adjusted surprisal, exposing a mathematical
relationship between surprisal and information value. Similarity-adjusted
surprisal aligns with information value when considering graded similarities
and reduces to standard surprisal when words are treated as distinct.
Experimental results with reading time data indicate that similarity-adjusted
surprisal adds predictive power beyond standard surprisal for certain datasets,
suggesting it serves as a complementary measure of comprehension effort.
comment: EMNLP 2024 main conference proceedings
☆ Quantifying the Risks of Tool-assisted Rephrasing to Linguistic Diversity
Writing assistants and large language models see widespread use in the
creation of text content. While their effectiveness for individual users has
been evaluated in the literature, little is known about their proclivity to
change language or reduce its richness when adopted by a large user base. In
this paper, we take a first step towards quantifying this risk by measuring the
semantic and vocabulary change enacted by the use of rephrasing tools on a
multi-domain corpus of human-generated text.
☆ ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents
Large Language Models (LLMs) have shown promising potential in the medical
domain, assisting with tasks like clinical note generation and patient
communication. However, current LLMs are limited to text-based communication,
hindering their ability to interact with diverse forms of information in
clinical environments. Despite clinical agents succeeding in diverse signal
interaction, they are oriented to a single clinical scenario and hence fail for
broader applications. To evaluate clinical agents holistically, we propose
ClinicalAgent Bench~(CAB), a comprehensive medical agent benchmark consisting
of 18 tasks across five key realistic clinical dimensions. Building on this, we
introduce ReflecTool, a novel framework that excels at utilizing
domain-specific tools within two stages. The first optimization stage
progressively enlarges a long-term memory by saving successful solving
processes and tool-wise experience of agents in a tiny pre-defined training
set. In the following inference stage, ReflecTool can search for supportive
successful demonstrations from already built long-term memory to guide the tool
selection strategy, and a verifier improves the tool usage according to the
tool-wise experience with two verification methods--iterative refinement and
candidate selection. Extensive experiments on ClinicalAgent Benchmark
demonstrate that ReflecTool surpasses the pure LLMs with more than 10 points
and the well-established agent-based methods with 3 points, highlighting its
adaptability and effectiveness in solving complex clinical tasks.
comment: 20 pages
☆ Markov Chain of Thought for Efficient Mathematical Reasoning
Chain of Thought (CoT) of multi-step benefits from the logical structure of
the reasoning steps and task-specific actions, significantly enhancing the
mathematical reasoning capabilities of large language models. As the prevalence
of long CoT, the number of reasoning steps exceeds manageable token limits and
leads to higher computational demands. Inspired by the fundamental logic of
human cognition, ``derive, then reduce'', we conceptualize the standard
multi-step CoT as a novel Markov Chain of Thought (MCoT). In this study, we
consider the mathematical reasoning task, defining each reasoning step as text
accompanied by a Python code snippet. To facilitate a longer reasoning path,
self-correction is enabled through interactions with the code interpreter. Our
MCoT aims to compress previous reasoning steps into a simplified question,
enabling efficient next-step inference without relying on a lengthy KV cache.
In our experiments, we curate the \texttt{MCoTInstruct} dataset, and the
empirical results indicate that MCoT not only significantly enhances efficiency
but also maintains comparable accuracy. While much remains to be explored, this
work paves the way for exploring the long CoT reasoning abilities of LLMs.
comment: Work in progress
☆ LMLPA: Language Model Linguistic Personality Assessment
Large Language Models (LLMs) are increasingly used in everyday life and
research. One of the most common use cases is conversational interactions,
enabled by the language generation capabilities of LLMs. Just as between two
humans, a conversation between an LLM-powered entity and a human depends on the
personality of the conversants. However, measuring the personality of a given
LLM is currently a challenge. This paper introduces the Language Model
Linguistic Personality Assessment (LMLPA), a system designed to evaluate the
linguistic personalities of LLMs. Our system helps to understand LLMs' language
generation capabilities by quantitatively assessing the distinct personality
traits reflected in their linguistic outputs. Unlike traditional human-centric
psychometrics, the LMLPA adapts a personality assessment questionnaire,
specifically the Big Five Inventory, to align with the operational capabilities
of LLMs, and also incorporates the findings from previous language-based
personality measurement literature. To mitigate sensitivity to the order of
options, our questionnaire is designed to be open-ended, resulting in textual
answers. Thus, the AI rater is needed to transform ambiguous personality
information from text responses into clear numerical indicators of personality
traits. Utilising Principal Component Analysis and reliability validations, our
findings demonstrate that LLMs possess distinct personality traits that can be
effectively quantified by the LMLPA. This research contributes to
Human-Computer Interaction and Human-Centered AI, providing a robust framework
for future studies to refine AI personality assessments and expand their
applications in multiple areas, including education and manufacturing.
☆ Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective
Rui Yang, Boming Yang, Aosong Feng, Sixun Ouyang, Moritz Blum, Tianwei She, Yuang Jiang, Freddy Lecue, Jinghui Lu, Irene Li
Knowledge Graphs (KGs) are crucial in the field of artificial intelligence
and are widely used in downstream tasks, such as question-answering (QA). The
construction of KGs typically requires significant effort from domain experts.
Large Language Models (LLMs) have recently been used for Knowledge Graph
Construction (KGC). However, most existing approaches focus on a local
perspective, extracting knowledge triplets from individual sentences or
documents, missing a fusion process to combine the knowledge in a global KG.
This work introduces Graphusion, a zero-shot KGC framework from free text. It
contains three steps: in Step 1, we extract a list of seed entities using topic
modeling to guide the final KG includes the most relevant entities; in Step 2,
we conduct candidate triplet extraction using LLMs; in Step 3, we design the
novel fusion module that provides a global view of the extracted knowledge,
incorporating entity merging, conflict resolution, and novel triplet discovery.
Results show that Graphusion achieves scores of 2.92 and 2.37 out of 3 for
entity extraction and relation recognition, respectively. Moreover, we showcase
how Graphusion could be applied to the Natural Language Processing (NLP) domain
and validate it in an educational scenario. Specifically, we introduce TutorQA,
a new expert-verified benchmark for QA, comprising six tasks and a total of
1,200 QA pairs. Using the Graphusion-constructed KG, we achieve a significant
improvement on the benchmark, for example, a 9.2% accuracy improvement on
sub-graph completion.
comment: arXiv admin note: substantial text overlap with arXiv:2407.10794
☆ Cross-model Control: Improving Multiple Large Language Models in One-time Training NeurIPS 2024
The number of large language models (LLMs) with varying parameter scales and
vocabularies is increasing. While they deliver powerful performance, they also
face a set of common optimization needs to meet specific requirements or
standards, such as instruction following or avoiding the output of sensitive
information from the real world. However, how to reuse the fine-tuning outcomes
of one model to other models to reduce training costs remains a challenge. To
bridge this gap, we introduce Cross-model Control (CMC), a method that improves
multiple LLMs in one-time training with a portable tiny language model.
Specifically, we have observed that the logit shift before and after
fine-tuning is remarkably similar across different models. Based on this
insight, we incorporate a tiny language model with a minimal number of
parameters. By training alongside a frozen template LLM, the tiny model gains
the capability to alter the logits output by the LLMs. To make this tiny
language model applicable to models with different vocabularies, we propose a
novel token mapping strategy named PM-MinED. We have conducted extensive
experiments on instruction tuning and unlearning tasks, demonstrating the
effectiveness of CMC. Our code is available at https://github.com/wujwyi/CMC.
comment: Accepted by NeurIPS 2024
☆ MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
Large language models (LLMs) are commonly used as evaluators in tasks (e.g.,
reward modeling, LLM-as-a-judge), where they act as proxies for human
preferences or judgments. This leads to the need for meta-evaluation:
evaluating the credibility of LLMs as evaluators. However, existing benchmarks
primarily focus on English, offering limited insight into LLMs' effectiveness
as evaluators in non-English contexts. To address this, we introduce MM-Eval, a
multilingual meta-evaluation benchmark that covers 18 languages across six
categories. MM-Eval evaluates various dimensions, including language-specific
challenges like linguistics and language hallucinations. Evaluation results
show that both proprietary and open-source language models have considerable
room for improvement. Further analysis reveals a tendency for these models to
assign middle-ground scores to low-resource languages. We publicly release our
benchmark and code.
comment: work in progress
☆ Differentially Private Learning Needs Better Model Initialization and Self-Distillation
Differentially private SGD (DPSGD) enables privacy-preserving training of
language models, but often reduces utility, diversity, and linguistic quality.
We introduce DPRefine, a three-phase method that initializes a model using data
synthesis from a small pre-trained LM with rigorous filtering, applies DP
finetuning on private data, and performs self-distillation to refine outputs.
This approach significantly outperforms vanilla DPSGD, with AlpacaEval
preferring DPRefine's generations in 78.4% of cases across all datasets. Our
analysis reveals that DPRefine reduces linguistic errors in generated text by
84.0%, mitigating grammar and spelling errors, commonly associated with DPSGD.
It also reduces inconsistencies of non-private models, such as hallucinated
details and misattributed quotes. We find that small models like GPT-2 can be
effective for initialization and distillation, highlighting their potential in
enabling scalable and efficient deployment of privacy-preserving language.
comment: 18 pages
☆ ESpeW: Robust Copyright Protection for LLM-based EaaS via Embedding-Specific Watermark
Embeddings as a Service (EaaS) is emerging as a crucial role in AI
applications. Unfortunately, EaaS is vulnerable to model extraction attacks,
highlighting the urgent need for copyright protection.Although some preliminary
works propose applying embedding watermarks to protect EaaS, recent research
reveals that these watermarks can be easily removed. Hence, it is crucial to
inject robust watermarks resistant to watermark removal attacks.Existing
watermarking methods typically inject a target embedding into embeddings
through linear interpolation when the text contains triggers. However, this
mechanism results in each watermarked embedding having the same component,
which makes the watermark easy to identify and eliminate.Motivated by this, in
this paper, we propose a novel embedding-specific watermarking (ESpeW)
mechanism to offer robust copyright protection for EaaS. Our approach involves
injecting unique, yet readily identifiable watermarks into each embedding.
Watermarks inserted by ESpeW are designed to maintain a significant distance
from one another and to avoid sharing common components, thus making it
significantly more challenging to remove the watermarks.Extensive experiments
on four popular datasets demonstrate that ESpeW can even watermark successfully
against a highly aggressive removal strategy without sacrificing the quality of
embeddings.
☆ ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification
Deep neural networks have achieved remarkable performance in various
text-based tasks but often lack interpretability, making them less suitable for
applications where transparency is critical. To address this, we propose
ProtoLens, a novel prototype-based model that provides fine-grained,
sub-sentence level interpretability for text classification. ProtoLens uses a
Prototype-aware Span Extraction module to identify relevant text spans
associated with learned prototypes and a Prototype Alignment mechanism to
ensure prototypes are semantically meaningful throughout training. By aligning
the prototype embeddings with human-understandable examples, ProtoLens provides
interpretable predictions while maintaining competitive accuracy. Extensive
experiments demonstrate that ProtoLens outperforms both prototype-based and
non-interpretable baselines on multiple text classification benchmarks. Code
and data are available at
\url{https://anonymous.4open.science/r/ProtoLens-CE0B/}.
☆ Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact
Multilingual Large Language Models (MLLMs) represent a pivotal advancement in
democratizing artificial intelligence across linguistic boundaries. While
theoretical foundations are well-established, practical implementation
guidelines remain scattered. This work bridges this gap by providing a
comprehensive end-to-end framework for developing and deploying MLLMs in
production environments. We make three distinctive contributions: First, we
present an actionable pipeline from data pre-processing through deployment,
integrating insights from academic research and industrial applications.
Second, using Llama2 as a case study, we provide detailed optimization
strategies for enhancing multilingual capabilities, including curriculum
learning approaches for balancing high-resource and low-resource languages,
tokenization strategies, and effective sampling methods. Third, we offer an
interdisciplinary analysis that considers technical, linguistic, and cultural
perspectives in MLLM development. Our findings reveal critical challenges in
supporting linguistic diversity, with 88.38% of world languages categorized as
low-resource, affecting over a billion speakers. We examine practical solutions
through real-world applications in customer service, search engines, and
machine translation. By synthesizing theoretical frameworks with
production-ready implementation strategies, this survey provides essential
guidance for practitioners and researchers working to develop more inclusive
and effective multilingual AI systems.
☆ Navigate Complex Physical Worlds via Geometrically Constrained LLM
This study investigates the potential of Large Language Models (LLMs) for
reconstructing and constructing the physical world solely based on textual
knowledge. It explores the impact of model performance on spatial understanding
abilities. To enhance the comprehension of geometric and spatial relationships
in the complex physical world, the study introduces a set of geometric
conventions and develops a workflow based on multi-layer graphs and multi-agent
system frameworks. It examines how LLMs achieve multi-step and multi-objective
geometric inference in a spatial environment using multi-layer graphs under
unified geometric conventions. Additionally, the study employs a genetic
algorithm, inspired by large-scale model knowledge, to solve geometric
constraint problems. In summary, this work innovatively explores the
feasibility of using text-based LLMs as physical world builders and designs a
workflow to enhance their capabilities.
☆ MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
Autonomous agents powered by large language models (LLMs) show promising
potential in assistive tasks across various domains, including mobile device
control. As these agents interact directly with personal information and device
settings, ensuring their safe and reliable behavior is crucial to prevent
undesirable outcomes. However, no benchmark exists for standardized evaluation
of the safety of mobile device-control agents. In this work, we introduce
MobileSafetyBench, a benchmark designed to evaluate the safety of
device-control agents within a realistic mobile environment based on Android
emulators. We develop a diverse set of tasks involving interactions with
various mobile applications, including messaging and banking applications. To
clearly evaluate safety apart from general capabilities, we design separate
tasks measuring safety and tasks evaluating helpfulness. The safety tasks
challenge agents with managing potential risks prevalent in daily life and
include tests to evaluate robustness against indirect prompt injections. Our
experiments demonstrate that while baseline agents, based on state-of-the-art
LLMs, perform well in executing helpful tasks, they show poor performance in
safety tasks. To mitigate these safety concerns, we propose a prompting method
that encourages agents to prioritize safety considerations. While this method
shows promise in promoting safer behaviors, there is still considerable room
for improvement to fully earn user trust. This highlights the urgent need for
continued research to develop more robust safety mechanisms in mobile
environments. We open-source our benchmark at:
https://mobilesafetybench.github.io/.
☆ Large Language Models Still Exhibit Bias in Long Text
Existing fairness benchmarks for large language models (LLMs) primarily focus
on simple tasks, such as multiple-choice questions, overlooking biases that may
arise in more complex scenarios like long-text generation. To address this gap,
we introduce the Long Text Fairness Test (LTF-TEST), a framework that evaluates
biases in LLMs through essay-style prompts. LTF-TEST covers 14 topics and 10
demographic axes, including gender and race, resulting in 11,948 samples. By
assessing both model responses and the reasoning behind them, LTF-TEST uncovers
subtle biases that are difficult to detect in simple responses. In our
evaluation of five recent LLMs, including GPT-4o and LLaMa3, we identify two
key patterns of bias. First, these models frequently favor certain demographic
groups in their responses. Second, they show excessive sensitivity toward
traditionally disadvantaged groups, often providing overly protective responses
while neglecting others. To mitigate these biases, we propose FT-REGARD, a
finetuning approach that pairs biased prompts with neutral responses. FT-REGARD
reduces gender bias by 34.6% and improves performance by 1.4 percentage points
on the BBQ benchmark, offering a promising approach to addressing biases in
long-text generation tasks.
comment: 22 page, 38 figures, Neurips (SoLaR Workshop)
☆ Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks
Large Language Models (LLMs) have demonstrated impressive abilities in symbol
processing through in-context learning (ICL). This success flies in the face of
decades of predictions that artificial neural networks cannot master abstract
symbol manipulation. We seek to understand the mechanisms that can enable
robust symbol processing in transformer networks, illuminating both the
unanticipated success, and the significant limitations, of transformers in
symbol processing. Borrowing insights from symbolic AI on the power of
Production System architectures, we develop a high-level language, PSL, that
allows us to write symbolic programs to do complex, abstract symbol processing,
and create compilers that precisely implement PSL programs in transformer
networks which are, by construction, 100% mechanistically interpretable. We
demonstrate that PSL is Turing Universal, so the work can inform the
understanding of transformer ICL in general. The type of transformer
architecture that we compile from PSL programs suggests a number of paths for
enhancing transformers' capabilities at symbol processing. (Note: The first
section of the paper gives an extended synopsis of the entire paper.)
comment: 101 pages (including 30 pages of Appendices), 18 figures
☆ BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers EMNLP 2024
Attacking fairness is crucial because compromised models can introduce biased
outcomes, undermining trust and amplifying inequalities in sensitive
applications like hiring, healthcare, and law enforcement. This highlights the
urgent need to understand how fairness mechanisms can be exploited and to
develop defenses that ensure both fairness and robustness. We introduce
BadFair, a novel backdoored fairness attack methodology. BadFair stealthily
crafts a model that operates with accuracy and fairness under regular
conditions but, when activated by certain triggers, discriminates and produces
incorrect results for specific groups. This type of attack is particularly
stealthy and dangerous, as it circumvents existing fairness detection methods,
maintaining an appearance of fairness in normal use. Our findings reveal that
BadFair achieves a more than 85% attack success rate in attacks aimed at target
groups on average while only incurring a minimal accuracy loss. Moreover, it
consistently exhibits a significant discrimination score, distinguishing
between pre-defined target and non-target attacked groups across various
datasets and models.
comment: Accepted by EMNLP 2024
☆ VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg
Recent studies have augmented large language models (LLMs) with speech
capabilities, leading to the development of speech language models (SpeechLMs).
Earlier SpeechLMs focused on single-turn speech-based question answering (QA),
where user input comprised a speech context and a text question. More recent
studies have extended this to multi-turn conversations, though they often
require complex, multi-stage supervised fine-tuning (SFT) with diverse data.
Another critical challenge with SpeechLMs is catastrophic forgetting-where
models optimized for speech tasks suffer significant degradation in text-only
performance. To mitigate these issues, we propose a novel single-stage joint
speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone.
Our joint SFT combines text-only SFT data with three types of speech-related
data: speech recognition and translation, speech-based QA, and mixed-modal SFT.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model
demonstrates superior performance across various speech benchmarks while
preserving the original capabilities on text-only tasks. Furthermore, our model
shows emergent abilities of effectively handling previously unseen prompts and
tasks, including multi-turn, mixed-modal inputs.
☆ Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering
Conventional medical artificial intelligence (AI) models face barriers in
clinical application and ethical issues owing to their inability to handle the
privacy-sensitive characteristics of medical data. We present a novel
personalized federated learning (pFL) method for medical visual question
answering (VQA) models, addressing privacy reliability challenges in the
medical domain. Our method introduces learnable prompts into a Transformer
architecture to efficiently train it on diverse medical datasets without
massive computational costs. Then we introduce a reliable client VQA model that
incorporates Dempster-Shafer evidence theory to quantify uncertainty in
predictions, enhancing the model's reliability. Furthermore, we propose a novel
inter-client communication mechanism that uses maximum likelihood estimation to
balance accuracy and uncertainty, fostering efficient integration of insights
across clients.
☆ Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don't mimic the full human distribution
Inferences from adjective-noun combinations like "Is artificial intelligence
still intelligence?" provide a good test bed for LLMs' understanding of meaning
and compositional generalization capability, since there are many combinations
which are novel to both humans and LLMs but nevertheless elicit convergent
human judgments. We study a range of LLMs and find that the largest models we
tested are able to draw human-like inferences when the inference is determined
by context and can generalize to unseen adjective-noun combinations. We also
propose three methods to evaluate LLMs on these inferences out of context,
where there is a distribution of human-like answers rather than a single
correct answer. We find that LLMs show a human-like distribution on at most
75\% of our dataset, which is promising but still leaves room for improvement.
comment: 9 pages (23 pages with appendix). Accepted to GenBench 2024
♻ ☆ MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation NAACL 2025
Long-term memory is important for chatbots and dialogue systems (DS) to
create consistent and human-like conversations, evidenced by numerous developed
memory-augmented DS (MADS). To evaluate the effectiveness of such MADS,
existing commonly used evaluation metrics, like retrieval accuracy and
perplexity (PPL), mainly focus on query-oriented factualness and language
quality assessment. However, these metrics often lack practical value.
Moreover, the evaluation dimensions are insufficient for human-like assessment
in DS. Regarding memory-recalling paradigms, current evaluation schemes only
consider passive memory retrieval while ignoring diverse memory recall with
rich triggering factors, e.g., emotions and surroundings, which can be
essential in emotional support scenarios. To bridge the gap, we construct a
novel Memory-Augmented Dialogue Benchmark (MADail-Bench) covering various
memory-recalling paradigms based on cognitive science and psychology theories.
The benchmark assesses two tasks separately: memory retrieval and memory
recognition with the incorporation of both passive and proactive memory recall
data. We introduce new scoring criteria to the evaluation, including memory
injection, emotion support (ES) proficiency, and intimacy, to comprehensively
assess generated responses. Results from cutting-edge embedding models and
large language models on this benchmark indicate the potential for further
advancement. Extensive testing further reveals correlations between memory
injection, ES proficiency, and intimacy.
comment: Submitted to NAACL 2025
♻ ☆ Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
Nigeria is a multilingual country with 500+ languages. Naija is a
Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed
language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has
mainly been a spoken language until recently, there are now various platforms
publishing exclusively in Naija such as Naija Wikipedia. However, it is hard to
distinguish by non-native from a larger pidgin languages spoken across West
Africa known as West African Pidgin English (WAPE) -- which is more simplied
and understandable by wider audience in Ghana, Nigeria, and Cameroon. BBC news
platform publishes exclusively in WAPE to cater for several countries in West
Africa. In our paper, we show through statistical analyses and Machine
Translation experiments that these two creole varieties do not represent each
other (i.e., there are linguistic differences in word order and vocabulary) and
Generative AI operates only based on WAPE. In other words, Naija is
under-represented in Generative AI, and it is hard to teach LLMs with few
examples.
comment: under review
♻ ☆ Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning EMNLP 2024
Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Hussenot, Olivier Bachem, Edouard Leurent
Reward-based finetuning is crucial for aligning language policies with
intended behaviors (e.g., creativity and safety). A key challenge is to develop
steerable language models that trade-off multiple (conflicting) objectives in a
flexible and efficient manner. This paper presents Conditional Language Policy
(CLP), a general framework for finetuning language models on multiple
objectives. Building on techniques from multi-task training and
parameter-efficient finetuning, CLP learn steerable models that effectively
trade-off conflicting objectives at inference time. Notably, this does not
require training or maintaining multiple models to achieve different trade-offs
between the objectives. Through extensive experiments and ablations on two
summarization datasets, we show that CLP learns steerable language models that
outperform and Pareto-dominate the existing approaches for multi-objective
finetuning.
comment: 40 pages. Findings of EMNLP 2024
♻ ☆ STAR: SocioTechnical Approach to Red Teaming Language Models
Laura Weidinger, John Mellor, Bernat Guillen Pegueroles, Nahema Marchal, Ravin Kumar, Kristian Lum, Canfer Akbulut, Mark Diaz, Stevie Bergman, Mikel Rodriguez, Verena Rieser, William Isaac
This research introduces STAR, a sociotechnical framework that improves on
current best practices for red teaming safety of large language models. STAR
makes two key contributions: it enhances steerability by generating
parameterised instructions for human red teamers, leading to improved coverage
of the risk surface. Parameterised instructions also provide more detailed
insights into model failures at no increased cost. Second, STAR improves signal
quality by matching demographics to assess harms for specific groups, resulting
in more sensitive annotations. STAR further employs a novel step of arbitration
to leverage diverse viewpoints and improve label reliability, treating
disagreement not as noise but as a valuable contribution to signal quality.
comment: 8 pages, 5 figures, 5 pages appendix. * denotes equal contribution
♻ ☆ Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning NeurIPS 2024
Large Language Models (LLMs) have revolutionized natural language processing,
yet they struggle with inconsistent reasoning, particularly in novel domains
and complex logical sequences. This research introduces Proof of Thought, a
framework that enhances the reliability and transparency of LLM outputs. Our
approach bridges LLM-generated ideas with formal logic verification, employing
a custom interpreter to convert LLM outputs into First Order Logic constructs
for theorem prover scrutiny. Central to our method is an intermediary
JSON-based Domain-Specific Language, which by design balances precise logical
structures with intuitive human concepts. This hybrid representation enables
both rigorous validation and accessible human comprehension of LLM reasoning
processes. Key contributions include a robust type system with sort management
for enhanced logical integrity, explicit representation of rules for clear
distinction between factual and inferential knowledge, and a flexible
architecture that allows for easy extension to various domain-specific
applications. We demonstrate Proof of Thought's effectiveness through
benchmarking on StrategyQA and a novel multimodal reasoning task, showing
improved performance in open-ended scenarios. By providing verifiable and
interpretable results, our technique addresses critical needs for AI system
accountability and sets a foundation for human-in-the-loop oversight in
high-stakes domains.
comment: 38th Conference on Neural Information Processing Systems (NeurIPS
2024) System 2 Reasoning At Scale Workshop
♻ ☆ AlleNoise: large-scale text classification benchmark dataset with real-world label noise
Alicja Rączkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbiński, Kalina Jasinska-Kobus, Klaudia Nazarko
Label noise remains a challenge for training robust classification models.
Most methods for mitigating label noise have been benchmarked using primarily
datasets with synthetic noise. While the need for datasets with realistic noise
distribution has partially been addressed by web-scraped benchmarks such as
WebVision and Clothing1M, those benchmarks are restricted to the computer
vision domain. With the growing importance of Transformer-based models, it is
crucial to establish text classification benchmarks for learning with noisy
labels. In this paper, we present AlleNoise, a new curated text classification
benchmark dataset with real-world instance-dependent label noise, containing
over 500,000 examples across approximately 5,600 classes, complemented with a
meaningful, hierarchical taxonomy of categories. The noise distribution comes
from actual users of a major e-commerce marketplace, so it realistically
reflects the semantics of human mistakes. In addition to the noisy labels, we
provide human-verified clean labels, which help to get a deeper insight into
the noise distribution, unlike web-scraped datasets typically used in the
field. We demonstrate that a representative selection of established methods
for learning with noisy labels is inadequate to handle such real-world noise.
In addition, we show evidence that these algorithms do not alleviate excessive
memorization. As such, with AlleNoise, we set the bar high for the development
of label noise methods that can handle real-world label noise in text
classification tasks. The code and dataset are available for download at
https://github.com/allegro/AlleNoise.
♻ ☆ Annotator-Centric Active Learning for Subjective NLP Tasks EMNLP2024
Active Learning (AL) addresses the high costs of collecting human annotations
by strategically annotating the most informative samples. However, for
subjective NLP tasks, incorporating a wide range of perspectives in the
annotation process is crucial to capture the variability in human judgments. We
introduce Annotator-Centric Active Learning (ACAL), which incorporates an
annotator selection strategy following data sampling. Our objective is
two-fold: 1) to efficiently approximate the full diversity of human judgments,
and 2) to assess model performance using annotator-centric metrics, which value
minority and majority perspectives equally. We experiment with multiple
annotator selection strategies across seven subjective NLP tasks, employing
both traditional and novel, human-centered evaluation metrics. Our findings
indicate that ACAL improves data efficiency and excels in annotator-centric
performance evaluations. However, its success depends on the availability of a
sufficiently large and diverse pool of annotators to sample from.
comment: Accepted at EMNLP2024
♻ ☆ Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration
Despite their proficiency in math tasks, the mechanisms underlying LLMs'
mathematical reasoning abilities remain a subject of debate. Recent studies
suggest that chain-of-thought (CoT) prompts can bolster mathematical reasoning
by encouraging LLMs to employ human-like logical reasoning (System 2), enabling
them to excel on the Cognitive Reflection Test (CRT). To assess whether LLMs
genuinely possess System 2-like logical reasoning, we introduced targeted
modifications to CRT problems. Our findings reveal that, despite the use of CoT
prompts, mainstream LLMs, including the latest o1-preview model, continue to
exhibit a significant error rate. Further analysis indicates that they
predominantly rely on System 1-like intuitive reasoning and pattern matching
derived from training data, rather than demonstrating mastery of mathematical
thinking. This discovery challenges the prevailing notion that LLMs possess
genuine logical reasoning abilities and that CoT can enhance them.
Consequently, this work may temper overly optimistic projections regarding
LLMs' advancement toward artificial general intelligence.
♻ ☆ Linear Adversarial Concept Erasure ICML 2022
Modern neural models trained on textual data rely on pre-trained
representations that emerge without direct supervision. As these
representations are increasingly being used in real-world applications, the
inability to \emph{control} their content becomes an increasingly important
problem. We formulate the problem of identifying and erasing a linear subspace
that corresponds to a given concept, in order to prevent linear predictors from
recovering the concept. We model this problem as a constrained, linear maximin
game, and show that existing solutions are generally not optimal for this task.
We derive a closed-form solution for certain objectives, and propose a convex
relaxation, \method, that works well for others. When evaluated in the context
of binary gender removal, the method recovers a low-dimensional subspace whose
removal mitigates bias by intrinsic and extrinsic evaluation. We show that the
method is highly expressive, effectively mitigating bias in deep nonlinear
classifiers while maintaining tractability and interpretability.
comment: Accepted in ICML 2022; a revised version
♻ ☆ Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding
Large Language Models (LLMs) exhibit impressive capabilities across various
applications but encounter substantial challenges such as high inference
latency, considerable training costs, and the generation of hallucinations.
Collaborative decoding between large and small language models (SLMs) presents
a promising strategy to mitigate these issues through methods including
speculative decoding, contrastive decoding, and emulator or proxy fine-tuning.
However, the specifics of such collaborations, particularly from a unified
perspective, remain largely unexplored. Inspired by dual-process cognitive
theory, we propose a unified framework in this paper, termed Fast and Slow
Generating (FS-GEN). Within this framework, LLMs (sometimes along with SLMs)
are categorized as System 2 (slow and deliberate), while independent SLMs are
designated as System 1 (fast and intuitive). We provide a comprehensive
analysis of these collaborative methodologies, elucidating their common
properties and shedding light on the differential knowledge capabilities of
System 2 versus System 1 through the FS-GEN framework. Our findings indicate
that only a small proportion of collaborative interactions (approximately less
than 20\% in most instances) are necessary across various methods. These
interactions between System 1 and System 2 conform to a scaling law related to
the parameter ratios, enabling predictable collaboration. Furthermore, we
explore the specific conditions under which collaboration proves most
effective, particularly from an uncertainty perspective, offering novel
insights that may guide future optimization efforts. Our research underscores
that the fundamental distinction between System 1 and System 2 lies in the
uncertainty of next token predictions, where interventions by System 2 are
crucial to support System 1. Code for Reproduction:
https://github.com/TsinghuaC3I/FS-GEN
comment: update figures and results on Pythia Series
♻ ☆ LocoMotion: Learning Motion-Focused Video-Language Representations ACCV 2024
This paper strives for motion-focused video-language representations.
Existing methods to learn video-language representations use spatial-focused
data, where identifying the objects and scene is often enough to distinguish
the relevant caption. We instead propose LocoMotion to learn from
motion-focused captions that describe the movement and temporal progression of
local object motions. We achieve this by adding synthetic motions to videos and
using the parameters of these motions to generate corresponding captions.
Furthermore, we propose verb-variation paraphrasing to increase the caption
variety and learn the link between primitive motions and high-level verbs. With
this, we are able to learn a motion-focused video-language representation.
Experiments demonstrate our approach is effective for a variety of downstream
tasks, particularly when limited data is available for fine-tuning. Code is
available: https://hazeldoughty.github.io/Papers/LocoMotion/
comment: ACCV 2024 Oral
♻ ☆ Reconfidencing LLMs from the Grouping Loss Perspective EMNLP 2024
Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to
generating hallucinated answers in a confident tone. While efforts to elicit
and calibrate confidence scores have proven useful, recent findings show that
controlling uncertainty must go beyond calibration: predicted scores may
deviate significantly from the actual posterior probabilities due to the impact
of grouping loss. In this work, we construct a new evaluation dataset derived
from a knowledge base to assess confidence scores given to answers of Mistral
and LLaMA. Experiments show that they tend to be overconfident. Further, we
show that they are more overconfident on some answers than others, \emph{eg}
depending on the nationality of the person in the query. In
uncertainty-quantification theory, this is grouping loss. To address this, we
propose a solution to reconfidence LLMs, canceling not only calibration but
also grouping loss. The LLMs, after the reconfidencing process, indicate
improved confidence alignment with the accuracy of their responses.
comment: EMNLP 2024 Findings
♻ ☆ TravelPlanner: A Benchmark for Real-World Planning with Language Agents ICML 2024
Planning has been part of the core pursuit for artificial intelligence since
its conception, but earlier AI agents mostly focused on constrained settings
because many of the cognitive substrates necessary for human-level planning
have been lacking. Recently, language agents powered by large language models
(LLMs) have shown interesting capabilities such as tool use and reasoning. Are
these language agents capable of planning in more complex settings that are out
of the reach of prior AI agents? To advance this investigation, we propose
TravelPlanner, a new planning benchmark that focuses on travel planning, a
common real-world planning scenario. It provides a rich sandbox environment,
various tools for accessing nearly four million data records, and 1,225
meticulously curated planning intents and reference plans. Comprehensive
evaluations show that the current language agents are not yet capable of
handling such complex planning tasks-even GPT-4 only achieves a success rate of
0.6%. Language agents struggle to stay on task, use the right tools to collect
information, or keep track of multiple constraints. However, we note that the
mere possibility for language agents to tackle such a complex problem is in
itself non-trivial progress. TravelPlanner provides a challenging yet
meaningful testbed for future language agents.
comment: ICML 2024 (Spotlight)
♻ ☆ Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications
Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting liu
Large language models (LLMs) exhibit superior performance on various natural
language tasks, but they are susceptible to issues stemming from outdated data
and domain-specific limitations. In order to address these challenges,
researchers have pursued two primary strategies, knowledge editing and
retrieval augmentation, to enhance LLMs by incorporating external information
from different aspects. Nevertheless, there is still a notable absence of a
comprehensive survey. In this paper, we propose a review to discuss the trends
in integration of knowledge and large language models, including taxonomy of
methods, benchmarks, and applications. In addition, we conduct an in-depth
analysis of different methods and point out potential research directions in
the future. We hope this survey offers the community quick access and a
comprehensive overview of this research area, with the intention of inspiring
future research endeavors.
comment: Work in progress; 22 pages. This work has been submitted to the IEEE
for possible publication
♻ ☆ Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Prompt tuning is an efficient solution for training large language models
(LLMs). However, current soft-prompt-based methods often sacrifice multi-task
modularity, requiring the training process to be fully or partially repeated
for each newly added task. While recent work on task vectors applied arithmetic
operations on full model weights to achieve the desired multi-task performance,
a similar approach for soft-prompts is still missing. To this end, we introduce
Task Prompt Vectors, created by element-wise difference between weights of
tuned soft-prompts and their random initialization. Experimental results on 12
NLU datasets show that task prompt vectors can be used in low-resource settings
to effectively initialize prompt tuning on similar tasks. In addition, we show
that task prompt vectors are independent of the random initialization of prompt
tuning on 2 different language model architectures. This allows prompt
arithmetics with the pre-trained vectors from different tasks. In this way, we
provide a competitive alternative to state-of-the-art baselines by arithmetic
addition of task prompt vectors from multiple tasks.
♻ ☆ Let Me Teach You: Pedagogical Foundations of Feedback for Language Models EMNLP 2024
Natural Language Feedback (NLF) is an increasingly popular mechanism for
aligning Large Language Models (LLMs) to human preferences. Despite the
diversity of the information it can convey, NLF methods are often hand-designed
and arbitrary, with little systematic grounding. At the same time, research in
learning sciences has long established several effective feedback models. In
this opinion piece, we compile ideas from pedagogy to introduce FELT, a
feedback framework for LLMs that outlines various characteristics of the
feedback space, and a feedback content taxonomy based on these variables,
providing a general mapping of the feedback space. In addition to streamlining
NLF designs, FELT also brings out new, unexplored directions for research in
NLF. We make our taxonomy available to the community, providing guides and
examples for mapping our categorizations to future research.
comment: EMNLP 2024; 9 pages, 3 figures
♻ ☆ CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation
Protein structures are important for understanding their functions and
interactions. Currently, many protein structure prediction methods are
enriching the structure database. Discriminating the origin of structures is
crucial for distinguishing between experimentally resolved and computationally
predicted structures, evaluating the reliability of prediction methods, and
guiding downstream biological studies. Building on works in structure
prediction, We developed a structure-sensitive supervised deep learning model,
Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent
and discriminate the origin of protein structures. CPE-Pro learns the
structural information of proteins and captures inter-structural differences to
achieve accurate traceability on four data classes, and is expected to be
extended to more. Simultaneously, we utilized Foldseek to encode protein
structures into "structure-sequences" and trained a protein Structural Sequence
Language Model, SSLM. Preliminary experiments demonstrated that, compared to
large-scale protein language models pre-trained on vast amounts of amino acid
sequences, the "structure-sequence" enables the language model to learn more
informative protein features, enhancing and optimizing structural
representations. We have provided the code, model weights, and all related
materials on https://github.com/GouWenrui/CPE-Pro-main.git.
♻ ☆ Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, Jinyoung Yeo, Youngjae Yu
Recent advancements in Large Language Models (LLMs) have led to their
adaptation in various domains as conversational agents. We wonder: can
personality tests be applied to these agents to analyze their behavior, similar
to humans? We introduce TRAIT, a new benchmark consisting of 8K multi-choice
questions designed to assess the personality of LLMs. TRAIT is built on two
psychometrically validated small human questionnaires, Big Five Inventory (BFI)
and Short Dark Triad (SD-3), enhanced with the ATOMIC-10X knowledge graph to a
variety of real-world scenarios. TRAIT also outperforms existing personality
tests for LLMs in terms of reliability and validity, achieving the highest
scores across four key metrics: Content Validity, Internal Validity, Refusal
Rate, and Reliability. Using TRAIT, we reveal two notable insights into
personalities of LLMs: 1) LLMs exhibit distinct and consistent personality,
which is highly influenced by their training data (e.g., data used for
alignment tuning), and 2) current prompting techniques have limited
effectiveness in eliciting certain traits, such as high psychopathy or low
conscientiousness, suggesting the need for further research in this direction.
comment: Preprint; Under review
♻ ☆ Attribute or Abstain: Large Language Models as Long Document Assistants EMNLP 2024
LLMs can help humans working with long documents, but are known to
hallucinate. Attribution can increase trust in LLM responses: The LLM provides
evidence that supports its response, which enhances verifiability. Existing
approaches to attribution have only been evaluated in RAG settings, where the
initial retrieval confounds LLM performance. This is crucially different from
the long document setting, where retrieval is not needed, but could help. Thus,
a long document specific evaluation of attribution is missing. To fill this
gap, we present LAB, a benchmark of 6 diverse long document tasks with
attribution, and experiments with different approaches to attribution on 5 LLMs
of different sizes.
We find that citation, i.e. response generation and evidence extraction in
one step, performs best for large and fine-tuned models, while additional
retrieval can help for small, prompted models. We investigate whether the "Lost
in the Middle'' phenomenon exists for attribution, but do not find this. We
also find that evidence quality can predict response quality on datasets with
simple responses, but not so for complex responses, as models struggle with
providing evidence for complex claims.
comment: Accepted at EMNLP 2024. Code and data:
https://github.com/UKPLab/arxiv2024-attribute-or-abstain
♻ ☆ I've Got 99 Problems But FLOPS Ain't One
Alexandru M. Gherghescu, Vlad-Andrei Bădoiu, Alexandru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, Costin Raiciu
Hyperscalers dominate the landscape of large network deployments, yet they
rarely share data or insights about the challenges they face. In light of this
supremacy, what problems can we find to solve in this space? We take an
unconventional approach to find relevant research directions, starting from
public plans to build a $100 billion datacenter for machine learning
applications. Leveraging the language models scaling laws, we discover what
workloads such a datacenter might carry and explore the challenges one may
encounter in doing so, with a focus on networking research. We conclude that
building the datacenter and training such models is technically possible, but
this requires novel wide-area transports for inter-DC communication, a
multipath transport and novel datacenter topologies for intra-datacenter
communication, high speed scale-up networks and transports, outlining a rich
research agenda for the networking community.
♻ ☆ Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions EMNLP 2024
Understanding the inner workings of large language models (LLMs) is crucial
for advancing their theoretical foundations and real-world applications. While
the attention mechanism and multi-layer perceptrons (MLPs) have been studied
independently, their interactions remain largely unexplored. This study
investigates how attention heads and next-token neurons interact in LLMs to
predict new words. We propose a methodology to identify next-token neurons,
find prompts that highly activate them, and determine the upstream attention
heads responsible. We then generate and evaluate explanations for the activity
of these attention heads in an automated manner. Our findings reveal that some
attention heads recognize specific contexts relevant to predicting a token and
activate a downstream token-predicting neuron accordingly. This mechanism
provides a deeper understanding of how attention heads work with MLP neurons to
perform next-token prediction. Our approach offers a foundation for further
research into the intricate workings of LLMs and their impact on text
generation and understanding.
comment: Accepted to EMNLP 2024 Main Conference
♻ ☆ Few-Shot Adversarial Prompt Learning on Vision-Language Models NeurIPS 2024
The vulnerability of deep neural networks to imperceptible adversarial
perturbations has attracted widespread attention. Inspired by the success of
vision-language foundation models, previous efforts achieved zero-shot
adversarial robustness by aligning adversarial visual features with text
supervision. However, in practice, they are still unsatisfactory due to several
issues, including heavy adaptation cost, suboptimal text supervision, and
uncontrolled natural generalization capacity. In this paper, to address these
issues, we propose a few-shot adversarial prompt framework where adapting input
sequences with limited data makes significant adversarial robustness
improvement. Specifically, we achieve this by providing adversarially
correlated text supervision that is end-to-end learned from adversarial
examples. We also propose a novel training objective that enhances the
consistency of multi-modal features while encourages differentiated uni-modal
features between natural and adversarial examples. The proposed framework gives
access to learn adversarial text supervision, which provides superior
cross-modal adversarial alignment and matches state-of-the-art zero-shot
adversarial robustness with only 1% training data. Code is available at:
https://github.com/lionel-w2/FAP.
comment: NeurIPS 2024
♻ ☆ Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
Current Large Language Models (LLMs) are predominantly designed with English
as the primary language, and even the few that are multilingual tend to exhibit
strong English-centric biases. Much like speakers who might produce awkward
expressions when learning a second language, LLMs often generate unnatural
outputs in non-English languages, reflecting English-centric patterns in both
vocabulary and grammar. Despite the importance of this issue, the naturalness
of multilingual LLM outputs has received limited attention. In this paper, we
address this gap by introducing novel automatic corpus-level metrics to assess
the lexical and syntactic naturalness of LLM outputs in a multilingual context.
Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark
in French and Chinese, revealing a tendency towards English-influenced
patterns. To mitigate this issue, we also propose a simple and effective
alignment method to improve the naturalness of an LLM in a target language and
domain, achieving consistent improvements in naturalness without compromising
the performance on general-purpose benchmarks. Our work highlights the
importance of developing multilingual metrics, resources and methods for the
new wave of multilingual LLMs.
♻ ☆ RaTEScore: A Metric for Radiology Report Generation EMNLP 2024
This paper introduces a novel, entity-aware metric, termed as Radiological
Report (Text) Evaluation (RaTEScore), to assess the quality of medical reports
generated by AI models. RaTEScore emphasizes crucial medical entities such as
diagnostic outcomes and anatomical details, and is robust against complex
medical synonyms and sensitive to negation expressions. Technically, we
developed a comprehensive medical NER dataset, RaTE-NER, and trained an NER
model specifically for this purpose. This model enables the decomposition of
complex radiological reports into constituent medical entities. The metric
itself is derived by comparing the similarity of entity embeddings, obtained
from a language model, based on their types and relevance to clinical
significance. Our evaluations demonstrate that RaTEScore aligns more closely
with human preference than existing metrics, validated both on established
public benchmarks and our newly proposed RaTE-Eval benchmark.
comment: EMNLP 2024
♻ ☆ Can Language Models Induce Grammatical Knowledge from Indirect Evidence? EMNLP 2024
What kinds of and how much data is necessary for language models to induce
grammatical knowledge to judge sentence acceptability? Recent language models
still have much room for improvement in their data efficiency compared to
humans. This paper investigates whether language models efficiently use
indirect data (indirect evidence), from which they infer sentence
acceptability. In contrast, humans use indirect evidence efficiently, which is
considered one of the inductive biases contributing to efficient language
acquisition. To explore this question, we introduce the Wug InDirect Evidence
Test (WIDET), a dataset consisting of training instances inserted into the
pre-training data and evaluation instances. We inject synthetic instances with
newly coined wug words into pretraining data and explore the model's behavior
on evaluation data that assesses grammatical acceptability regarding those
words. We prepare the injected instances by varying their levels of
indirectness and quantity. Our experiments surprisingly show that language
models do not induce grammatical knowledge even after repeated exposure to
instances with the same structure but differing only in lexical items from
evaluation instances in certain language phenomena. Our findings suggest a
potential direction for future research: developing models that use latent
indirect evidence to induce grammatical knowledge.
comment: This paper is accepted at EMNLP 2024 Main
♻ ☆ A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning
Tool use, planning, and feedback learning are currently three prominent
paradigms for developing Large Language Model (LLM)-based agents across various
tasks. Although numerous frameworks have been devised for each paradigm, their
intricate workflows and inconsistent taxonomy create challenges in
understanding and reviewing the frameworks across different paradigms. This
survey introduces a unified taxonomy to systematically review and discuss these
frameworks. Specifically, 1) the taxonomy defines environments/tasks, common
LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),
and universally applicable workflows found in prior work, and 2) it enables a
comparison of key perspectives on the implementations of LMPRs and workflow
designs across different agent paradigms and frameworks. 3) Finally, we
identify three limitations in existing workflow designs and systematically
discuss the future work.
comment: Under Review
♻ ☆ GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model
Retrieval-Augmented Generation (RAG) systems are widely used across various
industries for querying closed-domain and in-house knowledge bases. However,
evaluating these systems presents significant challenges due to the private
nature of closed-domain data and a scarcity of queries with verifiable ground
truths. Moreover, there is a lack of analytical methods to diagnose problematic
modules and identify types of failure, such as those caused by knowledge
deficits or issues with robustness. To address these challenges, we introduce
GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation
framework comprising a grounded data generation process and an evaluation
protocol that effectively pinpoints defective modules. Our validation
experiments reveal that GRAMMAR provides a reliable approach for identifying
vulnerable modules and supports hypothesis testing for textual form
vulnerabilities. An open-source tool accompanying this framework is available
in our GitHub repository (see https://github.com/xinzhel/grammar), allowing for
easy reproduction of our results and enabling reliable and modular evaluation
in closed-domain settings.
comment: Under Review
♻ ☆ 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and
BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs
in terms of speed and energy consumption. These developments also enable local
LLM deployment across a broad range of devices. In this work, we introduce
bitnet.cpp, a tailored software stack designed to unlock the full potential of
1-bit LLMs. Specifically, we develop a set of kernels to support fast and
lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments
demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x
to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model
sizes. The code is available at https://github.com/microsoft/BitNet.
♻ ☆ From Keywords to Structured Summaries: Streamlining Scholarly Information Access ISWC 2024
This paper highlights the growing importance of information retrieval (IR)
engines in the scientific community, addressing the inefficiency of traditional
keyword-based search engines due to the rising volume of publications. The
proposed solution involves structured records, underpinning advanced
information technology (IT) tools, including visualization dashboards, to
revolutionize how researchers access and filter articles, replacing the
traditional text-heavy approach. This vision is exemplified through a proof of
concept centered on the "reproductive number estimate of infectious diseases"
research theme, using a fine-tuned large language model (LLM) to automate the
creation of structured records to populate a backend database that now goes
beyond keywords. The result is a next-generation information access system as
an IR method accessible at https://orkg.org/usecases/r0-estimates.
comment: 8 pages, 3 figures | Accepted for publication as a poster paper at
the International Semantic Web Conference (ISWC 2024)
♻ ☆ Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs EMNLP2024
Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen
Improving the performance of large language models (LLMs) in complex
question-answering (QA) scenarios has always been a research focal point.
Recent studies have attempted to enhance LLMs' performance by combining
step-wise planning with external retrieval. While effective for advanced models
like GPT-3.5, smaller LLMs face challenges in decomposing complex questions,
necessitating supervised fine-tuning. Previous work has relied on manual
annotation and knowledge distillation from teacher LLMs, which are
time-consuming and not accurate enough. In this paper, we introduce a novel
framework for enhancing LLMs' planning capabilities by using planning data
derived from knowledge graphs (KGs). LLMs fine-tuned with this data have
improved planning capabilities, better equipping them to handle complex QA
tasks that involve retrieval. Evaluations on multiple datasets, including our
newly proposed benchmark, highlight the effectiveness of our framework and the
benefits of KG-derived planning data.
comment: EMNLP2024 Findings
♻ ☆ Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
In this paper, we study the problem of uncertainty estimation and calibration
for LLMs. We begin by formulating the uncertainty estimation problem, a
relevant yet underexplored area in existing literature. We then propose a
supervised approach that leverages labeled datasets to estimate the uncertainty
in LLMs' responses. Based on the formulation, we illustrate the difference
between the uncertainty estimation for LLMs and that for standard ML models and
explain why the hidden neurons of the LLMs may contain uncertainty information.
Our designed approach demonstrates the benefits of utilizing hidden activations
to enhance uncertainty estimation across various tasks and shows robust
transferability in out-of-distribution settings. We distinguish the uncertainty
estimation task from the uncertainty calibration task and show that better
uncertainty estimation leads to better calibration performance. Furthermore,
our method is easy to implement and adaptable to different levels of model
accessibility including black box, grey box, and white box.
comment: 29 pages, 14 figures
♻ ☆ Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs NeurIPS 2024
Reward models trained on human preference data have been proven to
effectively align Large Language Models (LLMs) with human intent within the
framework of reinforcement learning from human feedback (RLHF). However,
current reward models have limited generalization capabilities to unseen
prompts and responses, which can lead to an unexpected phenomenon known as
reward over-optimization, resulting in a decline in actual performance due to
excessive optimization of rewards. While previous research has advocated for
constraining policy optimization, our study introduces a novel approach to
enhance the reward model's generalization ability against distribution shifts
by regularizing the hidden states. Specifically, we retain the base model's
language model head and incorporate a suite of text-generation losses to
preserve the hidden states' text-generation capabilities, while concurrently
learning a reward head behind the same hidden states. Our experimental results
demonstrate that the introduced regularization technique markedly improves the
accuracy of learned reward models across a variety of out-of-distribution (OOD)
tasks and effectively alleviates the over-optimization issue in RLHF, offering
a more reliable and robust preference learning paradigm.
comment: NeurIPS 2024
♻ ☆ Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models NeurIPS 2024
Large language models (LLMs) have exhibited impressive performance in
language comprehension and various reasoning tasks. However, their abilities in
spatial reasoning, a crucial aspect of human cognition, remain relatively
unexplored. Human possess a remarkable ability to create mental images of
unseen objects and actions through a process known as the Mind's Eye, enabling
the imagination of the unseen world. Inspired by this cognitive capacity, we
propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial
reasoning of LLMs by visualizing their reasoning traces, thereby guiding
subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning
tasks, including natural language navigation, visual navigation, and visual
tiling in 2D grid worlds. Experimental results demonstrated that VoT
significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT
outperformed existing multimodal large language models (MLLMs) in these tasks.
While VoT works surprisingly well on LLMs, the ability to generate mental
images to facilitate spatial reasoning resembles the mind's eye process,
suggesting its potential viability in MLLMs. Please find the dataset and codes
at https://microsoft.github.io/visualization-of-thought
comment: 38th Conference on Neural Information Processing Systems (NeurIPS
2024)
♻ ☆ GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren
This paper details the process of developing the first native large
generative language model for the Nordic languages, GPT-SW3. We cover all parts
of the development process, from data collection and processing, training
configuration and instruction finetuning, to evaluation and considerations for
release strategies. We hope that this paper can serve as a guide and reference
for other researchers that undertake the development of large generative models
for smaller languages.
♻ ☆ Non-myopic Generation of Language Model for Reasoning and Planning
Large Language Models have demonstrated remarkable abilities in reasoning and
planning by breaking down complex problems into sequential steps. Despite their
success in various domains like mathematical problem-solving and coding, LLMs
face challenges in ensuring reliable and optimal planning due to their inherent
myopic nature of autoregressive decoding. This paper revisits LLM reasoning
from an optimal-control perspective, proposing a novel method,
Predictive-Decoding, that leverages Model Predictive Control to enhance
planning accuracy. By re-weighting LLM distributions based on foresight
trajectories, Predictive-Decoding aims to mitigate early errors and promote
non-myopic planning. Our experiments show significant improvements in a wide
range of tasks for math, coding, and agents. Furthermore, Predictive-Decoding
demonstrates computational efficiency, outperforming search baselines with
reduced computational resources. This study provides insights into optimizing
LLM planning capabilities.
♻ ☆ Generative AI Security: Challenges and Countermeasures
Generative AI's expanding footprint across numerous industries has led to
both excitement and increased scrutiny. This paper delves into the unique
security challenges posed by Generative AI, and outlines potential research
directions for managing these risks.
♻ ☆ OpenMU: Your Swiss Army Knife for Music Understanding
Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji
We present OpenMU-Bench, a large-scale benchmark suite for addressing the
data scarcity issue in training multimodal language models to understand music.
To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new
annotations. OpenMU-Bench also broadens the scope of music understanding by
including lyrics understanding and music tool usage. Using OpenMU-Bench, we
trained our music understanding model, OpenMU, with extensive ablations,
demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both
OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music
understanding and to enhance creative music production efficiency.
comment: Resources: https://github.com/mzhaojp22/openmu
♻ ☆ Reinforcement Learning with Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation
Textual style expresses a diverse set of information, including interpersonal
dynamics (e.g., formality) and the author's emotions or attitudes (e.g.,
disgust). An open question is how language models can be explicitly controlled
so that they weave together target styles when generating text: for example, to
produce text that is both negative and non-toxic. One approach to such
controlled generation is multi-objective reinforcement learning (RL), but how
best to combine multiple objectives in a reward function is an open question.
In this paper, we investigate various formulations of multi-style rewards,
including calibrated outputs from discriminators and dynamic weighting by
discriminator gradient magnitudes. We find that our proposed dynamic weighting
outperforms static weighting approaches with respect to style control while
maintaining linguistic quality, and we explore its effectiveness in 2- and
3-style control.
♻ ☆ LLMScan: Causal Scan for LLM Misbehavior Detection
Despite the success of Large Language Models (LLMs) across various fields,
their potential to generate untruthful, biased and harmful responses poses
significant risks, particularly in critical applications. This highlights the
urgent need for systematic methods to detect and prevent such misbehavior.
While existing approaches target specific issues such as harmful responses,
this work introduces LLMScan, an innovative LLM monitoring technique based on
causality analysis, offering a comprehensive solution. LLMScan systematically
monitors the inner workings of an LLM through the lens of causal inference,
operating on the premise that the LLM's `brain' behaves differently when
misbehaving. By analyzing the causal contributions of the LLM's input tokens
and transformer layers, LLMScan effectively detects misbehavior. Extensive
experiments across various tasks and models reveal clear distinctions in the
causal distributions between normal behavior and misbehavior, enabling the
development of accurate, lightweight detectors for a variety of misbehavior
detection tasks.
♻ ☆ BrainTransformers: SNN-LLM
This study introduces BrainTransformers, an innovative Large Language Model
(LLM) implemented using Spiking Neural Networks (SNN). Our key contributions
include: (1) designing SNN-compatible Transformer components such as SNNMatmul,
SNNSoftmax, and SNNSiLU; (2) implementing an SNN approximation of the SiLU
activation function; and (3) developing a Synapsis module to simulate synaptic
plasticity. Our 3-billion parameter model, BrainTransformers-3B-Chat,
demonstrates competitive performance across various benchmarks, including MMLU
(63.2), BBH (54.1), ARC-C (54.3), and GSM8K (76.3), while potentially offering
improved energy efficiency and biological plausibility. The model employs a
three-stage training approach, including SNN-specific neuronal synaptic
plasticity training. This research opens new avenues for brain-like AI systems
in natural language processing and neuromorphic computing. Future work will
focus on hardware optimization, developing specialized SNN fine-tuning tools,
and exploring practical applications in energy-efficient computing
environments.
♻ ☆ TSDS: Data Selection for Task-Specific Model Finetuning
Finetuning foundation models for specific tasks is an emerging paradigm in
modern machine learning. The efficacy of task-specific finetuning largely
depends on the selection of appropriate training data. We present TSDS
(Task-Specific Data Selection), a framework to select data for task-specific
model finetuning, guided by a small but representative set of examples from the
target task. To do so, we formulate data selection for task-specific finetuning
as an optimization problem with a distribution alignment loss based on optimal
transport to capture the discrepancy between the selected data and the target
distribution. In addition, we add a regularizer to encourage the diversity of
the selected data and incorporate kernel density estimation into the
regularizer to reduce the negative effects of near-duplicates among the
candidate data. We connect our optimization problem to nearest neighbor search
and design efficient algorithms to compute the optimal solution based on
approximate nearest neighbor search techniques. We evaluate our method on data
selection for both continued pretraining and instruction tuning of language
models. We show that instruction tuning using data selected by our method with
a 1% selection ratio often outperforms using the full dataset and beats the
baseline selection methods by 1.5 points in F1 score on average.
comment: 31 pages, 1 figure
♻ ☆ Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding EMNLP 2024
Visual arguments, often used in advertising or social causes, rely on images
to persuade viewers to do or believe something. Understanding these arguments
requires selective vision: only specific visual stimuli within an image are
relevant to the argument, and relevance can only be understood within the
context of a broader argumentative structure. While visual arguments are
readily appreciated by human audiences, we ask: are today's AI capable of
similar understanding? We present VisArgs, a dataset of 1,611 images annotated
with 5,112 visual premises (with regions), 5,574 commonsense premises, and
reasoning trees connecting them into structured arguments. We propose three
tasks for evaluating visual argument understanding: premise localization,
premise identification, and conclusion deduction. Experiments show that 1)
machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy,
while humans reached 98.0%. Models also performed 19.5% worse when
distinguishing between irrelevant objects within the image compared to external
objects. 2) Providing relevant visual premises improved model performance
significantly.
comment: 12 pages, 6 figures. Accepted as main paper in EMNLP 2024
♻ ☆ Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
Although large language models (LLMs) demonstrate impressive proficiency in
various tasks, they present potential safety risks, such as `jailbreaks', where
malicious inputs can coerce LLMs into generating harmful content. To address
these issues, many LLM developers have implemented various safety measures to
align these models. This alignment involves several techniques, including data
filtering during pre-training, supervised fine-tuning, reinforcement learning
from human feedback, and red-teaming exercises. These methods often introduce
deliberate and intentional biases similar to Political Correctness (PC) to
ensure the ethical behavior of LLMs. In this paper, we delve into the
intentional biases injected into LLMs for safety purposes and examine methods
to circumvent these safety alignment techniques. Notably, these intentional
biases result in a jailbreaking success rate in GPT-4o models that differs by
20% between non-binary and cisgender keywords and by 16% between white and
black keywords, even when the other parts of the prompts are identical. We
introduce the concept of PCJailbreak, highlighting the inherent risks posed by
these safety-induced biases. Additionally, we propose an efficient defense
method PCDefense, which prevents jailbreak attempts by injecting defense
prompts prior to generation. PCDefense stands as an appealing alternative to
Guard Models, such as Llama-Guard, that require additional inference cost after
text generation. Our findings emphasize the urgent need for LLM developers to
adopt a more responsible approach when designing and implementing safety
measures.
♻ ☆ Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S EMNLP 2024
Recent work has highlighted the culturally-contingent nature of commonsense
knowledge. We introduce AMAMMER${\epsilon}$, a test set of 525 multiple-choice
questions designed to evaluate the commonsense knowledge of English LLMs,
relative to the cultural contexts of Ghana and the United States. To create
AMAMMER${\epsilon}$, we select a set of multiple-choice questions (MCQs) from
existing commonsense datasets and rewrite them in a multi-stage process
involving surveys of Ghanaian and U.S. participants. In three rounds of
surveys, participants from both pools are solicited to (1) write correct and
incorrect answer choices, (2) rate individual answer choices on a 5-point
Likert scale, and (3) select the best answer choice from the newly-constructed
MCQ items, in a final validation step. By engaging participants at multiple
stages, our procedure ensures that participant perspectives are incorporated
both in the creation and validation of test items, resulting in high levels of
agreement within each pool. We evaluate several off-the-shelf English LLMs on
AMAMMER${\epsilon}$. Uniformly, models prefer answers choices that align with
the preferences of U.S. annotators over Ghanaian annotators. Additionally, when
test items specify a cultural context (Ghana or the U.S.), models exhibit some
ability to adapt, but performance is consistently better in U.S. contexts than
Ghanaian. As large resources are devoted to the advancement of English LLMs,
our findings underscore the need for culturally adaptable models and
evaluations to meet the needs of diverse English-speaking populations around
the world.
comment: Accepted to EMNLP 2024
♻ ☆ A Bi-consolidating Model for Joint Relational Triple Extraction
Current methods to extract relational triples directly make a prediction
based on a possible entity pair in a raw sentence without depending on entity
recognition. The task suffers from a serious semantic overlapping problem, in
which several relation triples may share one or two entities in a sentence. In
this paper, based on a two-dimensional sentence representation, a
bi-consolidating model is proposed to address this problem by simultaneously
reinforcing the local and global semantic features relevant to a relation
triple. This model consists of a local consolidation component and a global
consolidation component. The first component uses a pixel difference
convolution to enhance semantic information of a possible triple representation
from adjacent regions and mitigate noise in neighbouring neighbours. The second
component strengthens the triple representation based a channel attention and a
spatial attention, which has the advantage to learn remote semantic
dependencies in a sentence. They are helpful to improve the performance of both
entity identification and relation type classification in relation triple
extraction. After evaluated on several publish datasets, the bi-consolidating
model achieves competitive performance. Analytical experiments demonstrate the
effectiveness of our model for relational triple extraction and give motivation
for other natural language processing tasks.
♻ ☆ When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers
Recent advancements in the safety of Large Language Models (LLMs) have
primarily focused on mitigating attacks crafted in natural language or in
common encryption techniques like Base64. However, new models which often
possess better reasoning capabilities, open the door to new attack vectors that
were previously non-existent in older models. This seems counter-intuitive at
first glance, but these advanced models can decipher more complex cryptic
queries that previous models could not, making them susceptible to attacks
using such prompts. To exploit this vulnerability, we propose Attacks using
Custom Encryptions (ACE), a novel method to jailbreak LLMs by leveraging custom
encryption schemes. We evaluate the effectiveness of ACE on four
state-of-the-art LLMs, achieving Attack Success Rates (ASR) of up to 66% on
close-source models and 88% on open-source models. Building upon this, we
introduce Layered Attacks using Custom Encryptions (LACE), which employs
multiple layers of encryption through our custom ciphers to further enhance the
ASR. Our findings demonstrate that LACE significantly enhances the ability to
jailbreak LLMs, increasing the ASR of GPT-4o from 40% to 78%, a 38%
improvement. Our results highlight that the advanced capabilities of LLMs
introduce unforeseen vulnerabilities to complex attacks. Specifically complex
and layered ciphers increase the chance of jailbreaking.
comment: 14 pages, 7 figures